OPR: Partial Deterministic Record and Replay for One-Sided Communication
نویسندگان
چکیده
Deterministic replay of parallel execution and communication operations is required both by HPC debuggers and resilience mechanisms. Despite its potential performance advantages, the inherent nondeterminism present in one-sided communication makes replaying difficult. The essential problem is that the readers of updated shared data do not have any information on which remote threads produced the updates. This paper presents OPR (One-sided communication Partial Record and Replay), the first known software tool for record and deterministic replay for one-sided communication. We have designed OPR from first principles with scalability as its main goal. OPR allows the user to specify a set of tasks of interest and then “records” their execution. The tasks in this set can be replayed, while any other task from the original execution can be abstracted away. OPR provides determinism by using a combination of dataand order-replay. To ensure scalability with the value and the order logs, we carefully optimize the recording stage: values are logged on the first read or only when changed; orderering is imprecisely maintained using a tailored vector clock algorithm. Our evaluation on deterministic and non-deterministic UPC programs shows that OPR introduced an overhead ranging from 1.3× to 27×, when running on 1,024 cores and tracking up
منابع مشابه
Efficient Deterministic Replay Using Complete Race Detection
Data races can significantly affect the executions of multi-threaded programs. Hence, one has to recur the results of data races to deterministically replay a multi-threaded program. However, data races are concealed in enormous number of memory operations in a program. Due to the difficulty of accurately identifying data races, previous multi-threaded deterministic record/replay schemes for co...
متن کاملDeterministic Process Groups in dOS
Current multiprocessor systems execute parallel and concurrent software nondeterministically: even when given precisely the same input, two executions of the same program may produce different output. This severely complicates debugging, testing, and automatic replication for fault-tolerance. Previous efforts to address this issue have focused primarily on record and replay, but making executio...
متن کاملImplementing record and refinement for debugging timing-dependent communication
Distributed applications are hard to debug because timing-dependent network communication is a source of non-deterministic behavior. Current approaches to debug non-deterministic failures include post-mortem debugging as well as record and replay. However, the first impairs system performance to gather data, whereas the latter requires developers to understand the timing-dependent communication...
متن کاملLEAP: The Lightweight Deterministic Multi-processor Replay of Concurrent Java Programs
The technique of deterministic record and replay aims at faithfully reenacting an earlier program execution. For concurrent programs, it is one of the most important techniques for program understanding and debugging. The state of the art deterministic replay techniques face challenging efficiency problems in supporting multi-processor executions due to the unoptimized treatment of shared memor...
متن کاملEfficient Deterministic Replay of Multithreaded Programs Based on Efficient Tracking of Cross-Thread Dependences
Shared-memory parallel programs are inherently nondeterministic, making it difficult to diagnose rare bugs and to achieve deterministic execution, e.g., for replication. Existing multithreaded record & replay approaches have serious limitations such as relying on custom hardware or slowing programs by an order of magnitude. This paper introduces an approach for multithreaded record & replay bas...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015